Feeding data-hungry LLMs without risking privacy
Posted: December 13, 2024
Saying that a Large Language Model (LLM) has an appetite for data is a little like saying that Mount Everest is a hill. An LLM is a type of Artificial Intelligence (AI) that can “understand,” generate, apply tasks based on human language.
This incredibly useful AI can take input in the form of a written or verbal command in normal human language and translate, perform tasks as the command directs, and write content.
LLMs have even expanded from working only with text to the ability to work with other types of tasks. This means that LLMs have the potential to democratize computing, allowing non-programmers to leverage the benefits of powerful technology directly.
The privacy problems that LLMs create are two-fold. First, LLMs require training, and that training process presents issues related to personal data. Second, LLMs operate on data sets, and that process can bump up against privacy concerns as well.
LLM training privacy concerns
All things being equal, the more written language an LLM can run through, the better the outcome. Given that the public internet is an almost endless pool of words, LLM developers may use web bots (web scraping technology) to pull down any words accessible by the Internet, including information by about individual human beings that, in some cases, the developers have no right to use. Additionally, once an LLM ingests training data, it becomes part of LLM itself. With the right prompt, a user can pull out training data, including personal information. Plus, any time a large amount of data combines into a large data pool, security breach becomes a risk.
LLM deployment privacy concerns
When a user sets an LLM on an activity, that LLM will also use data during deployment – a lot of data – potentially including personal data. There are security concerns with this, such as the risk of security breaches. Additionally, there are security risks related to non-compliance with access restrictions, if the LLM pulls data into itself on behalf of users with wider access privileges to data, but then makes this data available to users without access privileges to the data in question.
There is also the privacy question of how to handle individual rights. Where individuals have rights to access, delete, object, opt out, and/or correct, it is unclear how either the LLM developer or deployer company can accomplish this. Even the question of to what extent a deletion request should apply is uncertain. An LLM consumes the data going into it, and un-ringing that bell is challenging, if not impossible.
How to feed LLMs in a privacy sensitive manner
Though it may not be possible to eliminate all privacy and security risks when training or using an LLM, there are some practical controls a responsible company can put in place to make the process significantly more privacy sensitive.
- Define precise collection criteria – A responsible company will abide by the “minimum necessary” principle and carefully define what data types the LLM will collect, and from which types of sites it can collect data. For example, a company may want to prohibit an LLM from collecting sensitive data or special category data altogether. It may want to exclude social media sites as sources of information. A company may also want to exclude common identifier data points, such as names, email addresses, phone numbers, postal/physical addresses, social insurance numbers, and other identifiers.
- Use public or synthetic data – There are companies that offer synthetic data on which LLMs can train. Since these data sources do not contain real personal data at all, privacy issues related to LLM training can be avoided altogether. Additionally, government entities like the US Federal Government at data.gov, as well as private entities like Tableau, offer databases of public information.
- Define governance activities – Identify and document the governance activities needed to monitor data sources/inputs/outputs, train workers, and make decisions about appropriate LLM parameters, uses, and standards.
- Define security protocols – A company that deeply understands the particular security concerns related to AI will be able to establish security practices that address the right security risks. For example, a carefully crafted LLM deployment strategy can help reduce the risk of access privilege control breaches across the organization.
- Maintain transparency – Most regulatory guidance, including the European Union AI Act, includes some discussion about transparency. LLM training and deployment should include documentation that details data summaries, technical measures, reviews, and decisions related to privacy (as well as intellectual property, fairness, and other issues). Moreover, any LLM output should include a clear label to inform users that AI generated or assisted generation of that output.
- Consider privacy-preserving technology/techniques – Technology and data practices do not just present privacy challenges – they can help solve them. There are privacy-preserving technologies and analytics techniques that can help reduce privacy risk when training and deploying LLMs. Differential privacy, homomorphic encryption, secure multi-party computation, and secure enclaves are just a few methods to enhance data protection in AI applications, including LLMs.
Final thoughts
Training and using an LLM in a privacy-sensitive way might seem like a climb up Mount Everest. Any privacy professional will initially cringe at the idea of feeding enormous sets of data to an insatiable LLM. However, there are some practical actions a responsible company can take to help.
The first step is to understand how LLMs work and the technologies and practices available to overlay privacy protections. It may be possible to avoid using personal data at all during training using synthetic data, for example. Encryption and other techniques and technologies are also available to enhance privacy if personal data is required.
From there, a responsible company can take human measures to insert a measure of human intervention. A set of carefully controlled (and documented) criteria for data consumption will go a long way to reducing the amount and sensitivity of personal information involved. Additionally, through governance and transparency activities, a company can help ensure that human reviews and accountability occur at the right times to prevent, spotlight, and handle privacy concerns.